Response endpoint selection

ABSTRACT

A computing system has multiple endpoint computing devices in local environments to receive verbal requests from various users and a central or remote system to process the requests. The remote system generates responses and uses a variety of techniques to determine where and when to return responses audibly to the users. For each request, the remote system understands who is making the request, determines when to provide the response to the user, ascertains where the user is when it is time to deliver the response, discovers which of the endpoint devices are available to deliver the response, and evaluates which of the available devices is best to deliver the response. The system then delivers the response to the best endpoint device for audible emission or other form of presentation to the user.

This application is a continuation of, and claims priority to, U.S.patent application Ser. No. 15/049,917, entitled “RESPONSE ENDPOINTSELECTION BASED ON USER PROXIMITY DETERMINATION”, filed on Feb. 22,2016, and scheduled to issue as U.S. Pat. No. 10,778,778, and which is acontinuation of and claims priority to, U.S. patent application Ser. No.13/715,741, entitled “RESPONSE ENDPOINT SELECTION”, filed on Dec. 14,2012, and issued as U.S. Pat. No. 9,271,111, each of which isincorporated by reference herein in its entirety.

BACKGROUND

Homes, offices and other places are becoming more connected with theproliferation of computing devices such as desktops, tablets,entertainment systems, and portable communication devices. As thesecomputing devices evolve, many different ways have been introduced toallow users to interact with computing devices, such as throughmechanical devices (e.g., keyboards, mice, etc.), touch screens, motion,gesture, and even through natural language input such as speech.

As computing devices evolve, users are expected to rely more and more onsuch devices to assist them in routine tasks. Today, it is commonplacefor computing devices to help people buy tickets, shop for goods andservices, check the weather, find and play entertainment, and so forth.However, with the growing ubiquity of computing devices, it is notuncommon for users to have many devices, such as a smartphone, e-bookreader, a tablet, a computer, an entertainment system, and so forth. Oneof the challenges for multi-device users is how to perform taskseffectively when working with multiple devices. Coordinating a taskamong multiple devices is non-trivial.

Accordingly, there is a need for techniques to improve coordination ofuser activity in a ubiquitous computing device environment.

BRIEF DESCRIPTION OF THE DRAWINGS

The detailed description is described with reference to the accompanyingfigures. In the figures, the left-most digit(s) of a reference numberidentifies the figure in which the reference number first appears. Theuse of the same reference numbers in different figures indicates similaror identical components or features.

FIG. 1 illustrates an environment in which multiple computing devices,including voice controlled devices, are ubiquitous and coordinated toassist a person in handling routine tasks.

FIG. 2 shows a representative scenario of a person using the computingenvironment to assist with the task. FIG. 2 includes a functional blockdiagram of select components of computing devices in the environment aswell as remote cloud services accessible via a network.

FIG. 3 shows how devices are selected to engage the person duringperformance of the task.

FIG. 4 shows a block diagram of selected components of computing devicesthat may be used in the environment.

FIG. 5 is a flow diagram showing an illustrative process for aiding theperson in performing a task, including receiving a request from theperson via one device and delivering a response to the person viaanother device.

FIG. 6 is a flow diagram showing an illustrative process for determininga location of the person.

FIG. 7 is a flow diagram showing an illustrative process for determininga device to which to deliver the response to the person.

DETAILED DESCRIPTION

Described herein are techniques to leverage various computing devices toassist in routine tasks. As computing devices become ubiquitous inhomes, offices, and other places, users are less likely to differentiateamong them when thinking about and performing these routine tasks. Theusers will increasingly expect the devices to intelligently help,regardless of where the users are located and what the users mightcurrently be doing. To implement this intelligence, a computing systemis architected to organize task management across multiple devices withwhich the user may interact.

In one implementation, the computing system is constructed as a cloudservice that uses a variety of implicit and explicit signals todetermine presence of a user in a location and to decide which, if any,assistance or responses to provide to one or more devices within thatlocation. The signals may represent any number of indicia that can helpascertain the whereabouts of the user and how best to interact with theperson at that time, and at that location. Representative signals mayinclude audio input (e.g., sound of a user's voice), how recently theuser interacted with a device, presence of a mobile device associatedwith the user, visual recognition of the user, and so forth.

As one example scenario, suppose a user wants to remember to do a simplehousehold chore or work task. The user may ask the computing system, viaa first device, to remind him at a future time to do the household choreor work task. The computing system may then subsequently, at the futuretime, remind the user via a second device that is appropriate in thecurrent circumstances to deliver that message. In this case, thecomputing system understands who is making the request, determines whento provide the reminder to the user, ascertains where the user is whenit is time to remind him, discovers which devices are available todeliver the reminder, and evaluates which of the available devices isbest to deliver the reminder. In this manner, the computing systemimplements response functionality that includes intelligent selection ofendpoint devices.

The various operations to implement this intelligence may be split amonglocal devices and remote cloud computing systems. In variousimplementations, different modules and functionality may reside locallyin the devices proximal to the user, or remotely in the cloud servers.This disclosure provides one example implementation in which asignificant portion of the response system resides in the remote cloudcomputing system.

Further, this disclosure describes the techniques in the context oflocal computing devices that are primarily voice operated, such asdedicated voice controlled devices. Receiving verbal requests andproviding audible responses introduce some additional challenges, whichthe system described below is configured to address. However, use ofvoice controlled devices is not intended to be limiting as other formsof engaging the user (e.g., gesture input, typed input, visual output,etc.) may be used by the computing system.

Illustrative Architecture

FIG. 1 shows an illustrative architecture of a computing system 100 thatimplements response functionality with intelligent endpoint selection.For discussion purposes, the system 100 is described in the context ofusers going about their normal routines and interacting with thecomputing system 100 throughout the day. The computing system 100 isconfigured to receive requests given by users at respective times andlocations, process those requests, and return responses at otherrespective times, to locations at which the users are present, and toappropriate endpoint devices.

In this illustration, a house 102 is a primary residence for a family ofthree users, including a first user 104 (e.g., adult male, dad, husband,etc.), a second user 106 (e.g., adult female, mom, wife, etc.), and athird user 108 (e.g., daughter, child, girl, etc.). The house is shownwith five rooms including a master bedroom 110, a bathroom 112, achild's bedroom 114, a living room 116, and a kitchen 118. The users104-108 are located in different rooms in the house 102, with the firstuser 104 in the master bedroom 110, the second user 106 in the livingroom 116, and the third user 108 in the child's bedroom 114.

The computing system 100 includes multiple local devices or endpointdevices 120(1), . . . , 120(N) positioned at various locations tointeract with the users. These devices may take on any number of formfactors, such as laptops, electronic book (eBook) reader devices,tablets, desktop computers, smartphones, voice controlled devices,entertainment device, augmented reality systems, and so forth. In FIG.1, the local devices include a voice controlled device 120(1) residingin the bedroom 110, a voice controlled device 120(2) in the child'sbedroom 114, a voice controlled device 120(3) in the living room 116, alaptop 120(4) in the living room 116, and a voice controlled device120(5) in the kitchen 118. Other types of local devices may also beleveraged by the computing system, such as a smartphone 120(6) of thefirst user 104, cameras 120(7) and 120(8), and a television screen120(9). In addition, the computing system 100 may rely on otheruser-side devices found outside the home, such as in an automobile 122(e.g., car phone, navigation system, etc.) or at the first user's office124 (e.g., work computer, tablet, etc.) to convey information to theuser.

Each of these endpoint devices 120(1)-(N) may receive input from a userand deliver responses to the same user or different users. The input maybe received in any number of ways, including as audio or verbal input,gesture input, and so forth. The responses may also be delivered in anynumber of forms, including as audio output, visual output (e.g.,pictures, UIs, videos, etc. depicted on the laptop 120(4) or television120(9)), haptic feedback (e.g., vibration of the smartphone 120(6),etc.), and the like.

The computing system 100 further includes a remote computing system,such cloud services 130 supported by a collection of network-accessibledevices or servers 132. The cloud services 130 generally refer to anetwork-accessible platform implemented as a computing infrastructure ofprocessors, storage, software, data access, and so forth that ismaintained and accessible via a network, such as the Internet. Cloudservices 130 may not require end-user knowledge of the physical locationand configuration of the system that delivers the services. Commonexpressions associated with cloud services include “on-demandcomputing”, “software as a service (SaaS)”, “platform computing”,“network accessible platform”, and so forth.

The cloud services 130 coordinate request input and response outputamong the various local devices 120(1)-(N). At any one of the localdevices 120(1)-(N), a user, such as the user 104, may enter a requestfor the computing system 100 to handle. This request may be a verbalrequest, such as the user 104 speaking to the voice controlled device120(1) in the master bedroom 110. For instance, the user may say,“Please remind me to take out the garbage tomorrow morning.” The voicecontrolled device 120(1) is equipped with microphones to receive theaudio input and a network interface to pass the request to the cloudservices 130. The local device 120(1) may optionally have naturallanguage processing functionality to begin processing of the speechcontent.

The request is passed to the cloud services 130 over a network (notshown in FIG. 1) where the request is processed. The request is parsedand interpreted. In this example, the cloud services 130 determine thatthe user wishes to be reminded of the household chore to take out thegarbage at a specified timeframe (i.e., tomorrow morning). The cloudservices 130 implements a task handler to define a task that schedules areminder to be delivered to the user at the appropriate time (e.g., 7:00AM). When that time arrives, the cloud services 130 determine where thetarget user who made the request, i.e., the first user 104, is located.The cloud services 130 may use any number of techniques to ascertain theuser's whereabouts, such as polling devices in the area to get an audio,visual, or other biometric confirmation of presence, or locating adevice that might be personal or associated with the user (e.g.,smartphone 120(6)), or through other secondary indicia, such as theuser's history of activity, receipt of other input from the user from aspecific location, and so forth.

Once the user is located, the cloud services 130 may then determinewhich local device is suitable to deliver the response to the user. Insome cases, there may be only a single device and hence the decision isstraightforward. However, in other situations, the user may be locatedin an area having multiple local devices, any one of which may be usedto convey the response. In such situations, the cloud services 130 mayevaluate the various candidate devices, and select the best or moreappropriate device in the circumstances to deliver the response.

In this manner, the computing system 100 provides a coordinated responsesystem that utilizes ubiquitous devices available in the user'senvironment to receive requests and deliver responses. The endpointdevices used for receipt of the request and deliver of the response maybe different. Moreover, the devices need not be associated with the userin any way, but rather generic endpoint devices that are used as neededto interact with the user. To illustrate the flexibility of thecomputing system, the following discussion continues the earlier exampleof a user asking to be reminded to perform a household chore.

FIG. 2 illustrates select devices in the computing system 100 to show arepresentative scenario of a person using the computing environment toassist with the task. In this example, two endpoint devices are shown,with a first endpoint device in the form of the voice controlledassistant 120(1) residing in the bedroom 110 and the second endpointdevice in the form of the voice controlled assistant 120(5) residing inthe kitchen 118. The endpoint devices 120(1) and 120(5) are coupled tocommunicate with the remote cloud services 130 via a network 202. Thenetwork 202 may be representative of any number of network types, suchas wired networks (e.g., cable, LAN, etc.) and/or wireless networks(e.g., Bluetooth, RF, cellular, satellite, etc.).

Each endpoint or local device, as represented by the bedroom-baseddevice 120(1), is equipped with one or more processors 204,computer-readable media 206, one or more microphones 208, and a networkinterface 210. The computer-readable media 206 may include volatile andnonvolatile memory, removable and non-removable media implemented in anymethod or technology for storage of information, such ascomputer-readable instructions, data structures, program modules, orother data.

Local program modules 212 are shown stored in the media 206 forexecution by the processor(s) 204. The local modules 206 provide basicfunctionality to receive and process audio input received via themicrophones 208. The functionality may include filtering signals,analog-to-digital conversion, parsing sounds or words, and earlyanalysis of the parsed sounds or words. For instance, the local modules212 may include a wake word recognition module to recognize wake wordsthat are used to transition the voice controlled assistant 120(1) to anawake state for receiving input from the user. The local modules 212 mayfurther include some natural language processing functionality to begininterpreting the voice input from the user.

To continue the above example, suppose the user 104 makes a request tothe voice controlled assistant 120(1) in the bedroom 110 at a first timeof 9:30 PM. The request is for a reminder to perform a household chorein the morning. In this example, the user 104 speaks a wake word toalert the device 120(1) and then verbally gives the request, “Remind meto take out the garbage tomorrow morning” as indicated by the dialogbubble 213. The microphone(s) 208 receive the audio input and the localmodule(s) 212 process and recognize the wake word to initiate othermodules. The audio input may be parsed and partially analyzed, and/orpackaged and sent via the interface 210 and network 202 to the cloudservices 130.

The cloud services 130 include one or more network-accessible devices,such as servers 132. The servers 132 may include one or more processors214 and computer-readable media 216. The processor(s) 214 and thecomputer-readable media 216 of the servers 132 are physically separatefrom the processor(s) 204 and computer-readable media 206 of the device120(1), but may function jointly as part of a system that providesprocessing and memory in part on the device 120 and in part on the cloudservices 130. These servers 132 may be arranged in any number of ways,such as server farms, stacks, and the like that are commonly used indata centers.

The servers 132 may store and execute any number of programs, data,applications, and the like to provide services to the user. In thisexample architecture, the servers 132 are shown to store and executenatural language processing (NLP) modules 218, a task handler 222, aperson location module 224, and various applications 224. The NLPmodules 218 process the audio content received from the local device120(1) to interpret the request. If the local device is equipped with atleast some NLP capabilities, the NLP modules 218 may take that partialresults and complete the processing to interpret the user's verbalrequest.

The resulting interpretation is passed to the task handler 220 to handlethe request. In our example, the NLP modules 218 interpret the user'sinput as requesting a reminder to be scheduled and delivered at theappropriate time. The task handler 220 defines a task to set a reminderto be delivered at a time period associated with “tomorrow morning”. Thetask might include the contents (e.g., a reminder to “Don't forget totake out the garbage”), a time for delivery, and an expected location ofdelivery. The delivery time and expected location may be ascertainedfrom secondary indicia that the service 130 aggregates and searches. Forinstance, the task handler 220 may consult other indicia to betterunderstand what “tomorrow morning” might mean for this particular user104. One of the applications 224 may be a calendar that shows the userhas a meeting at the office at 7:30 AM, and hence is expected to leavethe house 102 by 7:00 AM. Accordingly, the task handler 220 may narrowthe range of possible times to before 7:00 AM. The task handler 220 mayfurther request activity history from a user profile application(another of the applications 224) to determine whether the user has anormal morning activity. Suppose, for example, that the user has shown apattern of arising by 6:00 AM and having breakfast around 6:30 AM. Fromthese additional indicia, the task handler 220 may decide an appropriatetime to deliver the reminder to be around 6:30 AM on the next day.Separately, the task handler 220 may further deduce that the user islikely to be in the kitchen at 6:30 AM the next day.

From this analysis, the task handler 220 sets a task for this request.In this example, a task is defined to deliver a reminder message at 6:30AM on the next day to a target user 104 via an endpoint device proximalto the kitchen 118. That is, the task might be structured as includingdata items of content, date/time, user identity, default endpointdevice, and default location. Once the request is understood and a taskis properly defined, the cloud services 130 may return a confirmation tothe user to be played by the first device 120(1) that received therequest while the user is still present. For instance, in response tothe request for a reminder 213, the cloud services 130 might send aconfirmation to be played by the bedroom device 120(1), such as astatement “Okay Scott, I'll remind you”, as shown by dialog bubble 215.In this manner, the user experience is one of a conversation with acomputing system. The user casually makes a request and the systemresponds in conversation. The statement may optionally include languagesuch as “tomorrow at 6:30 am in the kitchen” to provide confirmation ofthe intent and an opportunity for the user to correct the system'sunderstanding and plan.

The person location module 222 may further be used to help locate theuser and an appropriate endpoint device when the time comes to deliverthe response. Continuing the example, the task handler 220 mightinstruct the person location module 222 to help confirm a location ofthe user 104 as the delivery time of 6:30 AM approaches. Initially, theperson location module 222 may attempt to locate the user 104 byevaluating a location of a personal device that he carries, such as hissmartphone 120(6). Using information about the location of thesmartphone 120(6) (e.g., GPS, trilateration from cell towers, Wi-Fi basestation proximity, etc.), the person location module 222 may be able toconfirm that the user is indeed in the house 102. Since the defaultassumption is that the user will be in the kitchen 118, the personlocation module 222 may ask the local device 120(5) to confirm that thetarget user 104 is in the kitchen 118. In one implementation, the personlocation module 222 may direct the local device 120(5) to listen forvoices and then attempt to confirm that one of them is the target user104. For instance, the local device 120(5) may provide a greeting to thetarget user, using the user's name, such as “Good morning Scott” asindicated by dialog bubble 226. If the target user 104 is present, theuser may answer “Good morning”, as indicated by the dialog bubble 228.In an alternative implementation, the local device 120(5) may beequipped with voice recognition functionality to identify the targetuser by capturing his voice in the environment. As still anotherimplementation, the person location module 222 may request a visualimage from the camera 120(8) (See FIG. 1) in the kitchen to get a visualconfirmation that the target user 104 is in the kitchen.

When the delivery time arrives, the task handler 220 engages an endpointdevice to deliver the response. In this example, the task handler 220contacts the voice controlled assistant 120(5) in the kitchen 118 tosend the response. The content from the reminder task is extracted andsent to the device 120(5) for playback over the speaker. Here, at 6:30AM, the voice controlled assistant audibly emits the reminder, “Don'tforget to take out the garbage” as indicated by the dialog bubble 230.

As illustrated by this example, the computing system 100 is capable ofreceiving user input from one endpoint or local device 120, processingthe user input, and providing a timely response via another endpoint orlocal device 120. The user need not remember which device he gave therequest, or specify which device he receives the response. Indeed, itmight be any number of devices. Instead, the user experience is enhancedby the ubiquity of the devices, and the user will merely assume that thecomputer-enabled assistant system intuitively listened to the requestand provided a timely response.

In some situations, there may be multiple devices to choose from whendelivering the reminder. In this situation, the cloud services 130 mayinvolve evaluating the various devices to find a best fit for thecircumstances. Accordingly, one of the applications 224 may be anendpoint device selection module that attempts to identify the bestlocal endpoint device for engaging the user. One example scenario isprovided next to illustrate possible techniques for ascertaining thebest device.

FIG. 3 shows how local endpoint devices are selected to engage thetarget person during performance of the task. In this illustration, fourlocal endpoint devices 302, 304, 306, and 308 are shown in four areas orzones A-D, respectively. The zones A-D may represent different rooms,physical areas of a larger room, and so forth. In this example, thetarget user 104 is in Zone D. But, he is not alone. In addition, fourother people are shown in the same zone D.

An endpoint device selector 310 is shown stored in the computer-readablemedia 216 for execution on the processor(s) 214. The endpoint deviceselector 310 is configured to identify available devices to engage theuser 104, and then analyze them to ascertain the most appropriate devicein the circumstances. Suppose, for discussion purposes, that anyone ofthe four devices 302-308 may be identified as “available” devices thatare sufficient proximal to communicate with the user 104. There are manyways to determine available devices, such as detecting devices known tobe physically in or near areas proximal to the user, finding devicesthat pick up audio input from the user (e.g., casual conversation in aroom), devices associated with the user, user preferences, and so forth.

The endpoint device selector 310 next evaluates which of the availabledevices is most appropriate under the circumstances. There are severalways to make this evaluation. In one approach, a distance analysis maybe performed to determine the distances between a device and the targetperson. As shown in FIG. 3, the voice controlled assistant 308 isphysically closest to the target user 104 at a distance D1 and the voicecontrolled assistant 306 is next closest at a distance D2. Usingdistance, the endpoint device selector 310 may choose the closest voicecontrolled assistant 308 to deliver the response. However, physicalproximity may not be the best in all circumstances.

Accordingly, in another approach, audio characteristics in theenvironment surrounding the user 104 may be analyzed. For instance, thesignal-to-noise ratios are measured at various endpoint devices 302-308to ascertain which one is best at hearing the user to the exclusion ofother noise. As an alternative, the background volume may be analyzed todetermine whether the user is in an area of significant backgroundnoise, such as the result of a conversation of many people or backgroundaudio from a television or appliance. Still another possibility is toanalyze echo characteristics of the area, as well as perhaps evaluateDoppler characteristics that might be introduced as the user is movingthroughout one or more areas. That is, verbal commands from the user mayreach different devices in with more or less clarity and strengthdepending upon the movement and orientation of the user.

In still another approach, environment observations may be analyzed. Forinstance, a number of people in the vicinity may be counted based ondata from cameras (if any) or recognition of distinctive voices. In yetanother situation, a combination of physical proximity, soundvolume-based determination, and/or visual observation may indicate thatthe closest endpoint device is actually physically separated from thetarget user by a structural impediment (e.g., the device is located onthe other side of a wall in an adjacent room). In this case, even thoughthe device is proximally the closest in terms of raw distance, theendpoint device selector 310 removes the device from consideration.These are but a few examples.

Any one or more of these analyses may be performed to evaluate possibleendpoint devices. Suppose, for continuing discussion, that the endpointdevice selector 310 determines that the noise level and/or number ofpeople in zone D are too high to facilitate effective communication withthe target user 104. As a result, instead of choosing the closest voicecontrolled assistant 308, the endpoint selector 310 may direct the voicecontrolled assistant 306 in zone C to communicate with the target user104. In some instances, the assistant 306 may first attempt to get theuser's attention by playing a statement to draw the user closer, such as“Scott, I have a reminder for you” as represented by the dialog bubble312. In reaction to this message, the user 104 may move closer to thedevice 306 in zone C, thereby shrinking the distance D2 to a moresuitable length. For instance, the user 104 may move from a firstlocation in zone D to a new location in zone C as shown by an arrowlabeled “scenario A”. Thereafter, the task handler 220 may deliver thereminder to take out the garbage.

In addition, these techniques for identifying the most suitable devicefor delivering the response may aid in delivery of confidential orsensitive messages. For instance, suppose the target user 104 sets areminder to pick up an anniversary gift for his wife. In this situation,the endpoint device selector 310 will evaluate the devices in and nearthe user's current location in an effort to identify a device that candeliver the reminder without the user's wife being present to hear themessage. For instance, suppose the user 104 moves from zone D to zone Afor a temporary period of time (as illustrated by an arrow labeled“scenario B”), thereby leaving the other people (and his wife) in zoneD. Once the user is detected as being alone in zone A, the task handler220 may direct the voice controlled assistant 302 to deliver thereminder response to the user. This is shown, for example, by thestatement “Don't forget to pick up your wife's anniversary present” indialog bubble 314.

Aspects of the system described herein may be further used to supportreal time communication between two people. For example, consider ascenario where one user wants to send a message to another user in realtime. In this scenario, the first user may provide a message fordelivery to the second user. For instance, the first user may speak amessage to a first endpoint device, which sends the message to the cloudservices for processing. The cloud services may then determine alocation of the second user and select a second endpoint device that isavailable and suitable for delivery of the message to the second user.The message may then be presented to the second user via the secondendpoint device.

FIG. 4 shows selected functional components of devices 120(1)-(N) thatmay be used in the computing environment. As noted in FIG. 1, thedevices may be implemented in any number of ways and form factors. Inthis example, a device may be implemented as a standalone voicecontrolled device 120(1) that is relatively simple in terms offunctional capabilities with limited input/output components, memory,and processing capabilities. For instance, the voice controlled device120(1) does not have a keyboard, keypad, or other form of mechanicalinput. Nor does it have a display or touch screen to facilitate visualpresentation and user touch input. Instead, the device 120(1) may beimplemented with the ability to receive and output audio, a networkinterface (wireless or wire-based), power, and processing/memorycapabilities. In certain implementations, a limited set of one or moreinput components may be employed (e.g., a dedicated button to initiate aconfiguration, power on/off, etc.). Nonetheless, the primary andpotentially only mode of user interaction with the device 120(1) isthrough voice input and audible output.

The devices used in the system may also be implemented as a mobiledevice 120(6) such as a smartphone or personal digital assistant. Themobile device 120(6) may include a touch-sensitive display screen andvarious buttons for providing input as well as additional functionalitysuch as the ability to send and receive telephone calls. Alternativeimplementations of the voice controlled device 100 may also includeconfiguration as a computer, such as a laptop 120(4). The computer120(4) may include a keyboard, a mouse, a display screen, and any otherhardware or functionality that is typically found on a desktop,notebook, netbook, or other personal computing devices. The devices aremerely examples and not intended to be limiting, as the techniquesdescribed in this disclosure may be used in essentially any device thathas an ability to recognize speech input.

In the illustrated implementation, each of the devices 120 includes oneor more processors 402 and computer-readable media 404. Thecomputer-readable media 404 may include volatile and nonvolatile memory,removable and non-removable media implemented in any method ortechnology for storage of information, such as computer-readableinstructions, data structures, program modules, or other data. Suchmemory includes, but is not limited to, RAM, ROM, EEPROM, flash memoryor other memory technology, CD-ROM, digital versatile disks (DVD) orother optical storage, magnetic cassettes, magnetic tape, magnetic diskstorage or other magnetic storage devices, RAID storage systems, or anyother medium which can be used to store the desired information andwhich can be accessed by a computing device. The computer-readable media404 may be implemented as computer-readable storage media (“CRSM”),which may be any available physical media accessible by the processor(s)102 to execute instructions stored on the memory 404. In one basicimplementation, CRSM may include random access memory (“RAM”) and Flashmemory. In other implementations, CRSM may include, but is not limitedto, read-only memory (“ROM”), electrically erasable programmableread-only memory (“EEPROM”), or any other tangible medium which can beused to store the desired information and which can be accessed by theprocessor(s) 402.

Several modules such as instruction, datastores, and so forth may bestored within the computer-readable media 404 and configured to executeon the processor(s) 402. A few example functional modules are shown asapplications stored in the computer-readable media 404 and executed onthe processor(s) 402, although the same functionality may alternativelybe implemented in hardware, firmware, or as a system on a chip (SOC).

An operating system module 406 may be configured to manage hardware andservices within and coupled to the device 120 for the benefit of othermodules. A wake word recognition module 408 and a speech recognitionmodule 410 may employ any number of conventional speech recognitiontechniques such as use of natural language processing and extensivelexicons to interpret voice input. For example, the speech recognitionmodule 410 may employ general speech recognition techniques and the wakeword recognition module may include speech or phrase recognitionparticular to the wake word. In some implementations, the wake wordrecognition module 408 may employ a hidden Markov model that representsthe wake word itself. This model may be created in advance or on the flydepending on the particular implementation. In some implementations, thespeech recognition module 410 may initially be in a passive state inwhich the speech recognition module 410 does not recognize or respond tospeech. While the speech recognition module 410 is passive, the wakeword recognition module 408 may recognize or respond to wake words. Oncethe wake word recognition module 408 recognizes or responds to a wakeword, the speech recognition module 410 may enter an active state inwhich the speech recognition module 410 operates to detect any of thenatural language commands for which it is programmed or to which it iscapable of responding. While in the particular implementation shown inFIG. 4, the wake word recognition module 408 and the speech recognitionmodule 410 are shown as separate modules; whereas in otherimplementations, these modules may be combined.

Other local modules 412 may also be present on the device, dependingupon the implementation and configuration of the device. These modulesmay include more extensive speech recognition techniques, filters andecho cancellation modules, speaker detection and identification, and soforth.

The voice controlled device 100 may also include a plurality ofapplications 414 stored in the computer-readable media 404 or otherwiseaccessible to the device 120. In this implementation, the applications414 are a music player 416, a movie player 418, a timer 420, and apersonal shopper 422. However, the voice controlled device 120 mayinclude any number or type of applications and is not limited to thespecific examples shown here. The music player 416 may be configured toplay songs or other audio files. The movie player 418 may be configuredto play movies or other audio visual media. The timer 420 may beconfigured to provide the functions of a simple timing device and clock.The personal shopper 422 may be configured to assist a user inpurchasing items from web-based merchants.

Datastores may also be stored locally on the media 404, including acontent database 424 and one or more user profiles 426 of users thathave interacted with the device 120. The content database 424 storevarious content that may be played or presented by the device, such asmusic, books, magazines, videos and so forth. The user profile(s) 426may include user characteristics, preferences (e.g., user specific wakewords), usage history, library information (e.g., music play lists),online purchase history, and other information specific to an individualuser.

Generally, the voice controlled device 120 has input devices 428 andoutput devices 430. The input devices 428 may include a keyboard,keypad, mouse, touch screen, joystick, control buttons, etc.Specifically, one or more microphones 432 may function as input devicesto receive audio input, such as user voice input. In someimplementations, the input devices 428 may further include a camera tocapture images of user gestures. The output devices 430 may include adisplay, a light element (e.g., LED), a vibrator to create hapticsensations, or the like. Specifically, one a more speakers 434 mayfunction as output devices to output audio sounds.

A user may interact with the device 120 by speaking to it, and themicrophone 432 captures the user's speech. The device 120 cancommunicate back to the user by emitting audible statements through thespeaker 434. In this manner, the user can interact with the voicecontrolled device 120 solely through speech, without use of a keyboardor display.

The voice controlled device 120 might further include a wireless unit436 coupled to an antenna 438 to facilitate a wireless connection to anetwork. The wireless unit 436 may implement one or more of variouswireless technologies, such as Wi-Fi, Bluetooth, RF, and so on. A USBport 440 may further be provided as part of the device 120 to facilitatea wired connection to a network, or a plug-in network device thatcommunicates with other wireless networks. In addition to the USB port440, or as an alternative thereto, other forms of wired connections maybe employed, such as a broadband connection. In this manner, thewireless unit 436 and USB 440 form two of many examples of possibleinterfaces used to connect the device 120 to the network 202 forinteracting with the cloud services 130.

Accordingly, when implemented as the primarily-voice-operated device120(1), there may be no input devices, such as navigation buttons,keypads, joysticks, keyboards, touch screens, and the like other thanthe microphone(s) 432. Further, there may be no output such as a displayfor text or graphical output. The speaker(s) 434 may be the main outputdevice. In one implementation, the voice controlled device 120(1) mayinclude non-input control mechanisms, such as basic volume controlbutton(s) for increasing/decreasing volume, as well as power and resetbuttons. There may also be a simple light element (e.g., LED) toindicate a state such as, for example, when power is on.

Accordingly, the device 120(1) may be implemented as an aestheticallyappealing device with smooth and rounded surfaces, with one or moreapertures for passage of sound waves. The device 120(1) may merely havea power cord and optionally a wired interface (e.g., broadband, USB,etc.). Once plugged in, the device may automatically self-configure, orwith slight aid of the user, and be ready to use. As a result, thedevice 120(1) may be generally produced at a low cost. In otherimplementations, other I/O components may be added to this basic model,such as specialty buttons, a keypad, display, and the like.

Illustrative Processes

FIG. 5 shows an example process 500 for aiding a person in performing atask, including receiving a request from the person via one device anddelivering a response to the person via another device. The process 500may be implemented by the local endpoint devices 120(1)-(N) andserver(s) 132 of FIG. 1, or by other devices. This process (along withthe processes illustrated in FIGS. 6 and 7) is illustrated as acollection of blocks or actions in a logical flow graph. Some of theblocks represent operations that can be implemented in hardware,software, or a combination thereof. In the context of software, theblocks represent computer-executable instructions stored on one or morecomputer-readable media that, when executed by one or more processors,perform the recited operations. Generally, computer-executableinstructions include routines, programs, objects, components, datastructures, and the like that perform particular functions or implementparticular abstract data types. The order in which the operations aredescribed is not intended to be construed as a limitation, and anynumber of the described blocks can be combined in any order or inparallel to implement the processes.

For purposes of describing one example implementation, the blocks arearranged visually in FIG. 5 in columns beneath the endpoint devices120(1)-(N) and server(s) 132 to illustrate that these devices of thesystem 100 may perform these operations. That is, actions defined byblocks arranged beneath the devices 120(1)-(N) may be performed by anyone of the devices. In certain situations, part of the process, such asthe request input part, may be performed by a first endpoint device andanother part of the process, such as the response delivery part, may beperformed by a second endpoint device, as illustrated by the dashedboxes about portions of the flow diagram. Similarly, actions defined byblocks arranged beneath the server(s) 132 may be performed by one ormore server(s) 132.

At 502, a first local endpoint device 120(1) receives speech input atthe microphone(s) 208/434. The speech input may include a wake word toalert the device to intentional speech, or may be part of an ongoingdiscussion after the device is already awake and interacting with theuser. The speech input includes a request.

At 504, the speech recognition module 410 at the first local endpointdevice 120(1) attempts to discern whether the request in the speechinput would benefit from knowing the identity of the person. Saidanother way, is the request general or more personal? If it is notpersonal (i.e., the “no” branch form 504) and person identity is notbeneficial, the process 500 may proceed to some pre-processing of thespeech input at 508. For instance, the speech input may be a question,“What is the weather today?” This request may be considered general innature, and not personal, and hence the system need not remember who ismaking the request. On the other hand, the user may make a personalrequest (i.e., the “yes” branch from 504) where person identity isbeneficial, leading to an operation to identify the person at 506. Forinstance, suppose the speech input is “please remind me to take out thegarbage tomorrow morning” or “remind me to pick up my wife's anniversarypresent.” Both of these are examples of personal requests, with thelatter having a higher degree of sensitivity in how the reminder isconveyed. In these situations, the person is identified through usevoice identification (e.g., person A is talking), interchange context(male voice asks to take out garbage while in master bedroom), secondaryvisual confirmation, and so forth.

At 508, the first device 120(1) may optionally pre-process the speechinput prior to sending it to the server. For instance, the device mayapply natural language processing to the input, or compressionalgorithms to compress the data prior to sending it over to the servers132, or even encryption algorithms to encrypt the audio data.

At 510, the speech input is passed to the servers 132 along with anidentity of the first device 120(1) and an identity of the person, ifknown from 506. The identity of the device 120(1) may be a serialnumber, a registration number or the like, and is provided so that thetask handler operating at the servers 132 knows from where the userrequest originated. In some cases, a response may be immediatelyreturned to the first device 120(1), such as a response containing thecurrent weather information. In some cases, the identity of the firstdevice 120(1) may help confirm the identity of the user. Further, theuser's use of the first device to make a particular request at aparticular time of day may be recorded in the user's profile as a way totrack habits or patterns in the user's normal course of the day.Further, when the person identity is associated with the first device120(1), this association may be used in selecting a location andendpoint device through for delivery of responses to that identifieduser for a period of time shortly after receipt of the request, or fordelivery of future responses. It is also noted that in someimplementations, the identity of the person may be determined by theservers 132, rather than at the first device 120(1). In suchimplementations, the first device 120(1) passes audio datarepresentative of the speech input from the person, and the servers 132use the audio data and possibly other indicia to identify the person.

It is further noted that in some implementations, the user may set areminder for another person. For instance, a first user (e.g., thehusband Scott) may make a request for a second user (e.g., his wife,Elyn), such as “Please remind Elyn to pick up the prescription tomorrowafternoon”. In this situation, the request includes an identity ofanother user, which the servers at the cloud services will determine whothat might be, based on the user profile data.

At 512, the servers 132 at the cloud services 130 processes in thespeech input received from the first endpoint device 120(1). In oneimplementation, the processing may include decryption, decompression,and speech recognition. Once the audio data is parsed and understood,the task handler 220 determines an appropriate response. The taskhandler may consult any number of applications to generate the response.For instance, if the request is for a reminder to purchase airlinetickets tomorrow, the task handler may involve a travel application aspart of the solution of discovering airline prices when providing thereminder response tomorrow. In addition, the cloud services 130 may alsodetermine for whom the response is to be directed. The response islikely to be returned to the original requester, but in some cases, itcan be delivered to another person (in which the location determinationwould be with respect to the second person).

At 514, an immediate confirmation may be optionally sent to indicate tothe user that the request was received and will be handled. Forinstance, in response to a request for a reminder, the response might be“Okay Scott, I'll remind you.” The servers 130 return the confirmationto the same endpoint device 120(1) from which the request was received.At 516, the first device 120(1) receives and plays the confirmation sothat the user experience is one of a conversation, where the computingsystem heard the request and acknowledged it.

At 518, it is determined when to reply with a response. In oneimplementation, the task handler 220 discerns from the request anappropriate time to respond to the request. The user may use any numberof ways to convey a desired answer. For instance, the user may ask for areminder “before my company meeting” or “tomorrow morning” or at 5:00 PMon a date certain. Each of these has a different level of specificity.The latter is straightforward, with the task handler 220 setting aresponse for 5:00 PM. With respect to the two former examples, the taskhandler 220 may attempt to discern what “tomorrow morning” may bedepending upon the request. If the request is for a reminder to “takeout the garbage”, the timeframe associated with “tomorrow morning” islikely the time when the user is expected to be home in the morning(e.g., say at 6:30 AM as discussed above). If the request is for areminder to “meet with marketing”, the timeframe for “tomorrow morning”may be more like to 9:00 AM or 10:00 AM. Finally, if the request is for“before my company meeting”, the task handler 220 may consult a calendarto see when the “company meeting” is scheduled and will set a reminderfor a reasonable time period before that meeting is scheduled to start.

At 520, a location of the target person is determined in order toidentify the place to which the response is to be timely sent. Forinstance, as the time for response approaches, the person locationmodule 222 determines where the user may be located in order to delivera timely response. There are many ways to make this determination. Amore detailed discussion of this action is described below withreference to FIG. 6. Further, the target user may be the initialrequester or another person.

At 522, a device to which to send the response is determined. In oneimplementation, an endpoint device selector 310 evaluates possibledevices that might be available and then determines which endpointdevice might be best in the circumstances to send the response. Thereare many techniques for evaluating possible devices and discerning thebest fit. A more detailed discussion of this action is provided belowwith reference to FIG. 7.

At 524, an appropriate response is timely sent to the best-fit device atthe location of the target user. Suppose, for discussion purposes, thebest-fit device is a different endpoint device, such as a second localdevice 120(2), than the device 120(1) from which the request wasreceived.

At 526, the response is received and played (or otherwise manifested)for the target user. As shown in FIG. 5, the second device 120(2)receives the response, and plays it for the user who is believed to bein the vicinity. The response may be in any form (e.g., audio, visual,haptic, etc.) and may include essentially any type of message, reminder,etc. The response may be in an audio form, where it is played outthrough the speaker for the user to hear. With the continuing examples,the response may be “Don't forget to take out the garbage”, or “You haveyour company meeting in 15 minutes”.

The technique described above and illustrated in FIG. 5 is merely anexample and implementations are not limited to this technique. Rather,other techniques for operating the devices 120 and servers 132 may beemployed and the implementations of the system disclosed herein are notlimited to any particular technique.

FIG. 6 shows a more detailed process for determining a location of theperson, from act 520 of FIG. 5. At 602, an identity of the target personis received. As noted above with respect to act 506, certain requestswill include an identity of the person making the request, such as aunique user ID.

At 604, possible locations of the target person are determined. Thereare many ways to make this determination, several of which are presentedas representative examples. For instance, at 604-1, the person locationmodule 222 might poll optical devices throughout an environment toattempt to visually locate the target person. The optical devices, suchas cameras, may employ recognition software (e.g., facial recognition,feature recognition, etc.) to identify users. As used herein, “polling”refers to obtaining the optical information from the optical devices,which may involve actively requesting the information (e.g., a “pull”model) or receiving the information without request (e.g., a “push”model). In another approach, at 604-2, the person location module 222may poll audio devices throughout the environment to gain voiceconfirmation that the target person is present. Audio tools may be usedto evaluate audio input against pre-recorded vocal profiles to uniquelyidentify different people.

Another technique is to locate portable devices that may be associatedwith the target person, at 604-3. For instance, the person locationmodule 222 may interact with location software modules that locatedevices such as smartphones, tablets, or personal digital assistants viaGPS data and/or cell tower trilateration data. In some implementations,this technique may be used in cooperation with other approaches. Forinstance, this physical location data may help narrow a search for aperson to a particular residence or office, and then polling audio oroptical devices may be used to place the user in particular rooms orareas of the residence or office.

The person location module 222 may further consult with otherapplications in an effort to locate the user, such as a calendarapplication, at 604-4. The calendar application may specify where theuser is scheduled to be located at a particular time. This isparticularly useful when the user is in various meetings at the office.There are many other sources that may be consulted to provide otherindicia of the target person's whereabouts, as represented by 604-N.

Suppose the person location module 222 identifies multiple possiblelocations. At 606, the possible locations may be optionally ranked. Forinstance, each location may be assigned a confidence score indicatinghow likely the user is to be located there. Use of visual data may havea very high confidence score, whereas audio data has slightly lessconfidence associated with it. Use of a calendar item may have asignificantly lower confidence score attached as there is no guaranteethat the user is following the schedule.

At 608, the person location module 222 may engage one or more localdevices to interact with the target person to confirm his or herpresence. For instance, suppose the person location module 222 initiallybelieves the person is in a particular room. The person location module222 may direct one of the devices in the room to engage the person,perhaps through asking a question (e.g., “Scott, do you needanything?”). If the person is present, the person may naturally respond(e.g., “No, nothing. Thanks”). The person location module 222 may thenconfirm that the target person is present.

At 610, a location is chosen for delivery of the response to the user.The choice may be based on the ranked possible locations of action 606and/or on confirmation through a quick interaction of action 608.

FIG. 7 shows a more detailed process for determining an appropriatedevice to return the response, from action 522 of FIG. 5.

At 702, the location of the target person is received. This may bedetermined from the action 516, as illustrated in FIG. 6. Alternatively,the location of the target person may be pre-known or the user may haveinformed the system of where he or she was located.

At 704, possible devices proximal to the location of the target personare discovered as being available to deliver the response to the person.For example, if the user is found to be located in a room of a home oroffice, the computing endpoint device selector 310 discovers whether oneor more devices reside in the room of the house. The selector 310 mayconsult the user's profile to see what devices are associated with theuser, or may evaluate registration records that identify a residence orlocation in which the device is installed.

At 706, the available devices are evaluated to ascertain which might bethe best device in the circumstances to return a response to the targetperson. There are many approaches to make this determination, several ofwhich are presented as representative examples. For instance, at 706-1,a distance from the endpoint device to the target person may beanalyzed. If the endpoint device is equipped with depth sensors (e.g.,time of flight sensors), the depth value may be used. If multipledevices are in a room, the timing difference of receiving verbal inputfrom a user among the devices may be used to estimate the location ofthe person and which device might be closest.

At 706-2, the background volume in an environment containing the targetperson may be analyzed. High background volume may impact the ability ofthe device to communicate with the target user. For instance, suppose aroom has a first device located near an appliance and a second devicelocated across the room. If the appliance is operating, the backgroundvolume for the first device may be much greater than the backgroundvolume for the second device, thereby suggesting that the second devicemight be more appropriate in this case to communicate with the user.

At 706-3, the signal-to-noise ratios (SNRs) of various available devicesare analyzed. Devices with strong SNRs are given a preference over thosewith weaker SNRs.

At 706-4, echo characteristics of the environment may be analyzed. Abaseline reading is taken when the room is empty of humans and movingobjects to get an acoustical map of the surrounding environment,including location of surfaces and other objects that might cause soundecho. The echo characteristics may be measured at the time of engagementwith humans, including the target user, to determine whether people orobjects might change the acoustical map. Depending upon the outcome ofthese measurements, certain available devices may become moreappropriate for delivering the response to the target user.

At 706-5, Doppler characteristics of the environment, particularly withrespect to the target user's movement through the environment, may beanalyzed. In some cases, a user may be moving through an environmentfrom one part of a room to another part of the room, or from room toroom. In these cases, if the user is also speaking and conversing withthe computing system 100, there may be changing acoustics that affectwhich devices are the best to interact with the user, depending upon thedirection of the user's movement, and orientation of the user's headwhen speaking. The Doppler characteristics may therefore impact whichdevice is may be best for responding in a given set of circumstances.

At 706-6, the environment may be analyzed, such as how many people arein the room, or who in particular is in the room, and so forth. In someimplementations, visual data received from cameras or other opticaldevices may provide insights as to numbers of people, or identificationof people in the environment. This analysis may assist in determiningwhich device is most appropriate to deliver a response. For instance, ifa device is located in a room crowded with people, the system may feelanother device away from the crowd might be better.

There are many other types of analyses applied to evaluate possibledevices for providing the response, as represented by 706-M. Forinstance, another type of analysis is to review ownership orregistration information to discover an association between the targetuser and personal devices. Devices that are more personal to the targetuser may receive a higher score.

At 708, the response is evaluated to determine whether there are anyspecial criteria that might impact a decision of where to direct theresponse. For instance, in the scenario where the user asked for areminder to pick up his wife's present, the response will include anelement of privacy or sensitivity in that the system should not return areminder to a location where the target person's wife may accidentallyhear the reminder. Another example is where the user may be requestinginformation about a doctor appointment or personal financial data, whichis not intended for general consumption. There are myriad examples ofspecial criteria. Accordingly, at 708, these criteria are evaluated andused in the decision making process of finding the best endpoint deviceunder the circumstances.

At 710, the best endpoint device 120 is chosen. This decision may bebased on scoring the various analyses 706-1 to 706-M, ranking theresults, and applying any special criteria to the results. In thisexample, the device with the highest score in the end, will be chosen.

CONCLUSION

Although the subject matter has been described in language specific tostructural features, it is to be understood that the subject matterdefined in the appended claims is not necessarily limited to thespecific features described. Rather, the specific features are disclosedas illustrative forms of implementing the claims.

1.-20. (canceled)
 21. A computer-implemented method, comprising:receiving, from a first device, input audio data corresponding to aninput from a first user; processing the input audio data to determine arequest to provide a message to a second user, the second userassociated with at least a second device and a third device; receivingsecond data corresponding to a location of the second user; and based atleast in part on the second data, causing output data corresponding tothe message to be sent to the second device rather than the thirddevice.
 22. The computer-implemented method of claim 21, whereinprocessing the input audio data to determine the request comprisesperforming automatic speech recognition using the input audio data todetermine text data representing the input.
 23. The computer-implementedmethod of claim 21, further comprising: determining the second user isassociated with user profile data; and identifying the second deviceusing the user profile data.
 24. The computer-implemented method ofclaim 21, wherein receiving the second data comprises receiving globalpositioning system (GPS) data.
 25. The computer-implemented method ofclaim 21, wherein receiving the second data comprises receiving Wi-Fidata.
 26. The computer-implemented method of claim 21, wherein receivingthe second data comprises receiving cell tower data.
 27. Thecomputer-implemented method of claim 21, wherein the second devicecomprises a smartphone.
 28. The computer-implemented method of claim 21,further comprising: determining user profile data corresponding to thefirst user; and determining the second user using the user profile data.29. The computer-implemented method of claim 21, wherein the second datais received from the second device.
 30. The computer-implemented methodof claim 21, wherein the output data comprises output audio data and themethod further comprises: sending the output audio data to the seconddevice; and sending the second device a command that causes the seconddevice to output audio corresponding to the output audio data.
 31. Thecomputer-implemented method of claim 21, further comprising: sending thesecond device a command that causes the second device to display avisual output corresponding to the output data.
 32. A system comprising:at least one processor; and at least one memory comprising instructionsthat, when executed by the at least one processor, cause the system to:receive, from a first device, input audio data corresponding to an inputfrom a first user; process the input audio data to determine a requestto provide a message to a second user, the second user associated withat least a second device and a third device; receive second datacorresponding to a location of the second user; based at least in parton the second data, select the second device rather than the thirddevice; and cause output data corresponding to the message to be sent tothe second device.
 33. The system of claim 32, wherein the instructionsthat cause the system to process the input audio data to determine therequest comprise instructions that, when executed by the at least oneprocessor, cause the system to perform automatic speech recognitionusing the input audio data to determine text data representing theinput.
 34. The system of claim 32, wherein the at least one memoryfurther comprises instructions that, when executed by the at least oneprocessor, further cause the system to: determine the second user isassociated with user profile data; and select the second device basedfurther at least in part on the user profile data.
 35. The system ofclaim 32, wherein the second data comprises global positioning system(GPS) data.
 36. The system of claim 32, wherein the second datacomprises Wi-Fi data.
 37. The system of claim 32, wherein the seconddata comprises cell tower data.
 38. The system of claim 32, wherein thesecond device comprises a smartphone.
 39. The system of claim 32,wherein the at least one memory further comprises instructions that,when executed by the at least one processor, further cause the systemto: determine user profile data corresponding to the first user; anddetermine the second user using the user profile data.
 40. The system ofclaim 32, The system of claim 32, wherein the output data comprisesoutput audio data and wherein the at least one memory further comprisesinstructions that, when executed by the at least one processor, furthercause the system to: send the output audio data to the second device;and send the second device a command that causes the second device tooutput audio corresponding to the message.